guided meta-policy search
Guided Meta-Policy Search
Reinforcement learning (RL) algorithms have demonstrated promising results on complex tasks, yet often require impractical numbers of samples because they learn from scratch. Meta-RL aims to address this challenge by leveraging experience from previous tasks so as to more quickly solve new tasks. However, in practice, these algorithms generally also require large amounts of on-policy experience during the \emph{meta-training} process, making them impractical for use in many problems. To this end, we propose to learn a reinforcement learning procedure in a federated way, where individual off-policy learners can solve the individual meta-training tasks, and then consolidate these solutions into a single meta-learner. Since the central meta-learner learns by imitating the solutions to the individual tasks, it can accommodate either the standard meta-RL problem setting, or a hybrid setting where some or all tasks are provided with example demonstrations. The former results in an approach that can leverage policies learned for previous tasks without significant amounts of on-policy data during meta-training, whereas the latter is particularly useful in cases where demonstrations are easy for a person to provide. Across a number of continuous control meta-RL problems, we demonstrate significant improvements in meta-RL sample efficiency in comparison to prior work as well as the ability to scale to domains with visual observations.
Reviews: Guided Meta-Policy Search
The proposed method is a novel (and elegant) combination of existing techniques from off-policy learning, imitation learning, guided policy search, and meta-learning. The resulting algorithm is new and I believe it can be valuable to researchers in this field. I think the paper does not sufficiently discuss the related work of Rakelly et al [1] (PEARL). The paper is cited in the related work section as one of the methods based on "recurrent or recursive neural networks". If I remember correctly they don't use recurrency (only in one of their ablation studies to show that their method *outperforms* a recurrent-based encoder).
Reviews: Guided Meta-Policy Search
This paper has a valuable contribution to meta-learning and greatly improves the state-of-the-art. The proposed idea that explicitly separates the meta-learning algorithm and meta-objective learning into two distinct phases is significantly effective as shown in experiments. The idea is quite original compared to the current trend of unified the deep learning black box.Theoretical convergence analysis shows the proposed method significantly improved sample efficiency compared to some reference examples. The authors addressed a comparison to PEARL, adequately and the additional insight will strengthen the paper a lot. Given that the rebuttal was strong and the analysis well made, I found that the paper be accepted.
Guided Meta-Policy Search
Reinforcement learning (RL) algorithms have demonstrated promising results on complex tasks, yet often require impractical numbers of samples because they learn from scratch. Meta-RL aims to address this challenge by leveraging experience from previous tasks so as to more quickly solve new tasks. However, in practice, these algorithms generally also require large amounts of on-policy experience during the \emph{meta-training} process, making them impractical for use in many problems. To this end, we propose to learn a reinforcement learning procedure in a federated way, where individual off-policy learners can solve the individual meta-training tasks, and then consolidate these solutions into a single meta-learner. Since the central meta-learner learns by imitating the solutions to the individual tasks, it can accommodate either the standard meta-RL problem setting, or a hybrid setting where some or all tasks are provided with example demonstrations. The former results in an approach that can leverage policies learned for previous tasks without significant amounts of on-policy data during meta-training, whereas the latter is particularly useful in cases where demonstrations are easy for a person to provide.
Guided Meta-Policy Search
Mendonca, Russell, Gupta, Abhishek, Kralev, Rosen, Abbeel, Pieter, Levine, Sergey, Finn, Chelsea
Reinforcement learning (RL) algorithms have demonstrated promising results on complex tasks, yet often require impractical numbers of samples because they learn from scratch. Meta-RL aims to address this challenge by leveraging experience from previous tasks so as to more quickly solve new tasks. However, in practice, these algorithms generally also require large amounts of on-policy experience during the \emph{meta-training} process, making them impractical for use in many problems. To this end, we propose to learn a reinforcement learning procedure in a federated way, where individual off-policy learners can solve the individual meta-training tasks, and then consolidate these solutions into a single meta-learner. Since the central meta-learner learns by imitating the solutions to the individual tasks, it can accommodate either the standard meta-RL problem setting, or a hybrid setting where some or all tasks are provided with example demonstrations.
Guided Meta-Policy Search
Mendonca, Russell, Gupta, Abhishek, Kralev, Rosen, Abbeel, Pieter, Levine, Sergey, Finn, Chelsea
Reinforcement learning (RL) algorithms have demonstrated promising results on complex tasks, yet often require impractical numbers of samples because they learn from scratch. Meta-RL aims to address this challenge by leveraging experience from previous tasks in order to more quickly solve new tasks. However, in practice, these algorithms generally also require large amounts of on-policy experience during the meta-training process, making them impractical for use in many problems. To this end, we propose to learn a reinforcement learning procedure through imitation of expert policies that solve previously-seen tasks. This involves a nested optimization, with RL in the inner loop and supervised imitation learning in the outer loop. Because the outer loop imitation learning can be done with off-policy data, we can achieve significant gains in meta-learning sample efficiency. In this paper, we show how this general idea can be used both for meta-reinforcement learning and for learning fast RL procedures from multi-task demonstration data. The former results in an approach that can leverage policies learned for previous tasks without significant amounts of on-policy data during meta-training, whereas the latter is particularly useful in cases where demonstrations are easy for a person to provide. Across a number of continuous control meta-RL problems, we demonstrate significant improvements in meta-RL sample efficiency in comparison to prior work as well as the ability to scale to domains with visual observations.